Model Selection

Multimodal Instruction Following

# Multimodal Instruction Following

Dimple is the first discrete diffusion multimodal large language model (DMLLM) that combines autoregressive and diffusion training paradigms. After training on the same dataset as LLaVA-NEXT, it outperforms LLaVA-NEXT-7B by 3.9%.

Transformers English

Qwen2.5 VL 72B Instruct GGUF

Qwen2.5-VL-72B-Instruct is a 72B-parameter multimodal large model that supports vision-language tasks, capable of understanding and generating text related to images.

Text-to-Image English

Smolvlm2 500M Video Instruct Mlx 8bit Skip Vision

MLX format model converted from SmolVLM2-500M-Video-Instruct, supporting video-to-text tasks

Transformers English

A fine-tuned multimodal model based on unsloth/Llama-3.2-11B-Vision-Instruct, optimized for vision-language tasks and enhanced instruction-following capabilities, achieving 2x training acceleration through the Unsloth framework

Transformers English

Turkish LLaVA V0.1

A Turkish visual-language model specifically designed for multimodal visual instruction-following tasks, capable of processing both visual (image) and text inputs to understand and execute instructions provided in Turkish.

Image-to-Text Other

Spydaz Web AI Llava

LLaVa is an open-source multimodal chatbot, fine-tuned on GPT-generated multimodal instruction-following data based on LLaMA/Vicuna, specifically optimized for chat/instruction-following as a multimodal version of LLM.

Transformers Supports Multiple Languages

Llava 1.5 7b Llara D Inbc Aux B VIMA 80k

LLaRA is an open-source visual motion strategy model, fine-tuned from LLaVA-7b-v1.5 on instruction-following data and auxiliary datasets, primarily used for robotics research.

Denseconnector V1.5 8B

DenseConnector is an open-source chatbot, fine-tuned based on LLaMA/Vicuna and trained using GPT-generated multimodal instruction-following data.

Llava V1.6 Vicuna 7b

LLaVA is an open-source multimodal chatbot, fine-tuned on large language models using multimodal instruction-following data.

LLaVA is an open-source multimodal chatbot, fine-tuned based on a large language model, supporting interactions with both images and text.

LLaVA is a multimodal large model that achieves general-purpose visual assistant capabilities by connecting a visual encoder with a large language model

Japanese Stable Vlm

A vision-language instruction-following model capable of generating Japanese descriptions for input images and optionally processing input text (e.g., questions).

Transformers Japanese

Llava V1.5 Mlp2x 336px Pretrain Vicuna 7b V1.5

LLaVA is an open-source multimodal chatbot, fine-tuned based on LLaMA/Vicuna and trained with GPT-generated multimodal instruction-following data.

LLaVA is an open-source multimodal chatbot, fine-tuned based on LLaMA/Vicuna, supporting image-text interaction.

Speechgpt 7B Cm

SpeechGPT is a large language model with intrinsic cross-modal dialogue capabilities, capable of perceiving and generating multimodal content, supporting interaction via speech and text.

Speechgpt 7B Ma

SpeechGPT is a large language model with intrinsic cross-modal dialogue capabilities, capable of perceiving and generating multimodal content based on human instructions.

Instructblip Vicuna 7b 8bit

InstructBLIP-Vicuna-7B is a vision-language model based on Vicuna-7B, supporting image-to-text conversion tasks.

Mediocreatmybest

Llava Llama 2 7b Chat Lightning Lora Preview

LLaVA is an open-source multimodal chatbot, fine-tuned based on LLaMA/Vicuna and trained with GPT-generated multimodal instruction-following data.

Llava Lightning 7B Delta V1 1

LLaVA is an open-source chatbot fine-tuned with GPT-generated multimodal instruction-following data based on LLaMA/Vicuna

Llava 7b Delta V0

LLaVA is an open-source chatbot fine-tuned with GPT-generated multimodal instruction-following data based on LLaMA/Vicuna, supporting visual and language multimodal interactions.

Llava 13b Delta V0

LLaVA is an open-source chatbot fine-tuned with GPT-generated multimodal instruction-following data based on LLaMA/Vicuna, belonging to a Transformer-based autoregressive language model.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase